{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Selection in a sklearn pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is quite similar to [the first example](./01%20Feature%20Extraction%20and%20Selection.ipynb).\n",
    "This time however, we use the `sklearn` pipeline API of `tsfresh`.\n",
    "If you want to learn more, have a look at [the documentation](https://tsfresh.readthedocs.io/en/latest/text/sklearn_transformers.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import classification_report\n",
    "\n",
    "from tsfresh.examples import load_robot_execution_failures\n",
    "from tsfresh.transformers import RelevantFeatureAugmenter\n",
    "from tsfresh.utilities.dataframe_functions import impute"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load and Prepare the Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check out the first example notebook to learn more about the data and format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tsfresh.examples.robot_execution_failures import download_robot_execution_failures\n",
    "download_robot_execution_failures() \n",
    "df_ts, y = load_robot_execution_failures()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to use the extracted features to predict for each of the robot executions, if it was a failure or not.\n",
    "Therefore our basic \"entity\" is a single robot execution given by a distinct `id`.\n",
    "\n",
    "A dataframe with these identifiers as index needs to be prepared for the pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = pd.DataFrame(index=y.index)\n",
    "\n",
    "# Split data into train and test set\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build the pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We build a sklearn pipeline that consists of a feature extraction step (`RelevantFeatureAugmenter`) with a subsequent `RandomForestClassifier`.\n",
    "\n",
    "The `RelevantFeatureAugmenter` takes roughly the same arguments as `extract_features` and `select_features` do."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ppl = Pipeline([\n",
    "        ('augmenter', RelevantFeatureAugmenter(column_id='id', column_sort='time')),\n",
    "        ('classifier', RandomForestClassifier())\n",
    "      ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "    \n",
    "Here comes the tricky part!\n",
    "    \n",
    "The input to the pipeline will be our dataframe `X`, which one row per identifier.\n",
    "It is currently empty.\n",
    "But which time series data should the `RelevantFeatureAugmenter` to actually extract the features from?\n",
    "\n",
    "We need to pass the time series data (stored in `df_ts`) to the transformer.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, df_ts contains the time series of both train and test set, if you have different dataframes for \n",
    "train and test set, you have to call set_params two times \n",
    "(see further below on how to deal with two independent data sets)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ppl.set_params(augmenter__timeseries_container=df_ts);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are now ready to fit the pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ppl.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The augmenter has used the input time series data to extract time series features for each of the identifiers in the `X_train` and selected only the relevant ones using the passed `y_train` as target.\n",
    "These features have been added to `X_train` as new columns.\n",
    "The classifier can now use these features during trainings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "During interference, the augmentor does only extract the relevant features it has found out in the training phase and the classifier predicts the target using these features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred = ppl.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, finally we inspect the performance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(classification_report(y_test, y_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also find out, which columns the augmenter has selected"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ppl.named_steps[\"augmenter\"].feature_selector.relevant_features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    \n",
    "In this example we passed in an empty (except the index) `X_train` or `X_test` into the pipeline.\n",
    "However, you can also fill the input with other features you have (e.g. features extracted from the metadata)\n",
    "or even use other pipeline components before.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Separating the time series data containers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the example above we passed in a single `df_ts` into the `RelevantFeatureAugmenter`, which was used both for training and predicting.\n",
    "During training, only the data with the `id`s from `X_train` where extracted and during prediction the rest.\n",
    "\n",
    "However, it is perfectly fine to call `set_params` twice: once before training and once before prediction. \n",
    "This can be handy if you for example dump the trained pipeline to disk and re-use it only later for prediction.\n",
    "You only need to make sure that the `id`s of the enteties you use during training/prediction are actually present in the passed time series data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ts_train = df_ts[df_ts[\"id\"].isin(y_train.index)]\n",
    "df_ts_test = df_ts[df_ts[\"id\"].isin(y_test.index)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ppl.set_params(augmenter__timeseries_container=df_ts_train);\n",
    "ppl.fit(X_train, y_train);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "with open(\"pipeline.pkl\", \"wb\") as f:\n",
    "    pickle.dump(ppl, f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Later: load the fitted model and do predictions on new, unseen data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "with open(\"pipeline.pkl\", \"rb\") as f:\n",
    "    ppk = pickle.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ppl.set_params(augmenter__timeseries_container=df_ts_test);\n",
    "y_pred = ppl.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(classification_report(y_test, y_pred))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}